If you are looking to learn about clustering and have some prior experience with data science and R, this article is for you! We are three graduate students at the University of Virginia School of Data Science and we are here to explain the basics of clustering as well as some applications and examples in an easily digestible format.
Many data science problems revolve around prediction. These problems use training data where the response variable is known and attempt to build a model to predict the response when it is not known. However, some data science problems seek to simply group the data points according to some underlying structure or natural categories that are unknown. These situations are referred to as unsupervised learning because we do not know the true class labels yet we still seek to sort the data into groups. One of the most common approaches to solving this problem is clustering. There are a variety of clustering methods, which we will dig into later, but they all share the same purpose of sorting the data into groups based on features or metrics.
Clustering is often used to gain insight into the data by revealing patterns and similarities. Data scientists can use the groupings produced by clustering to identify common features of subgroups that may be useful to their domain-specific problem.
Another distance metric that is used is Jaccard distance, which is a measure of how dissimilar two sets are. This metric is an example of a distance used in set theory rather than on a cartesian plane. We can calculate the Jaccard distance by \(d_J(A,B) = 1 - \frac{|A\cap B|}{|A\cup B|} = 1 - \frac{|A\cap B|}{|A| + |B| - |A\cap B|}\).
Since the distance metric plays a key role in clustering, we want the comparison of distances between different features to be on the same scale. Thus, it is important to standardize the data prior to running clustering. The most common approach to standardizing each feature is to take the observation value and subtract the mean of that feature then divide by the standard deviation. This allows for a more fair comparison of distances while preserving the proportional distance for each feature. The resulting value is the number of standard deviations the value falls from the mean.
Another important concept to master regarding clustering is hard versus soft classification. Hard classification is a natural way to think about clustering because it takes each data point and assigns it to a cluster or group. Thus, all of the data points receive a single cluster label. Soft classification, on the other hand, assigns each data point a probability of being in each cluster rather than just a single label. In reality, soft classification happens behind the scenes of hard classification and the data point is assigned to the cluster to which it has the greatest probability of belonging. However, it can be useful not to assign these hard labels and rather to look at the probabilities produced by soft classification. By doing so, we are able to better account for the uncertainty regarding our clustering results.
Introduction: K-Means clustering is a popular method for unsupervised learning, due to its speed of computation relative to other clustering methods. A high level explanation of K-means is the following: K-means is an iterative algorithm that works by assigning points to the cluster with the nearest centroid, updating the cluster centroids based on the new points belonging to the respective cluster, and repeating this process until no point changes cluster from the previous iteration. To accomplish this, we must start with a number of cluster centroids, meaning we must also know the number of clusters we would like to end up with. This is one of the downsides of k-means clustering, however, there are a few heuristics that can be used to select an “optimal” value of k based on the results of the clustering.
Approaches: Before examining coding details, it is important to separate our high level understanding of k-means from its actual implementation. We will walk through two of the most prominent K-Means Clustering algorithms to highlight the slight differences between the explanation provided above and the most popular modern implementation.
Lloyd’s Algorithm (The Classical Approach):
The first algorithm we will discuss is Lloyd’s Algorithm. The implementation of this algorithm is essentially the same as general outline described above. The algorithm follows these steps:
Caution: There is a problem with this algorithm, though, which leads it to have a tendency to fall into a local optima depending on the choice of the random start. This extends from the fact that the points are assigned to the nearest centroid each time and does not account for how that cluster has changed in terms of its relationship between its centroid and the points that comprise it. This is why another algorithm, Hartigan-Wong K-Means, is typically used today.
Hartigan-Wong (The Modern Implementation):
In the implementation of K-Means, Hartigan and Wong propose to alter how points are assigned to their new clusters. Instead of simply assigning points to the cluster with the nearest centroid, the iterative process evaluates whether or not a point should be reassigned depending on how it affects the overall Within-Cluster sum of squares. Each observation is examined by removing said observation from its current cluster, updating the centroid of the cluster it was removed from, and evaluating the cluster assignment that would minimize the total within-cluster sum of squares. This means that the point will not necessarily be assigned to the cluster with the closest centroid. Let’s examine the steps in this algorithm in a clearer list:
Conclusion: Thus, for both algorithms, convergence to a solution is guaranteed, though these solutions may be local optima rather than global optima. Despite the possibility of both algorithms returning this local optima solution though, the Hartigan-Wong algorithm is generally the preferred algorithm for K-Means because it is harder for it to get stuck in a local optima since it considers the impact of the removal of the point from its existing cluster before moving it.
Hierarchical clustering is a clustering approach that uses a distance or similarity metric as well as a linkage method to form clusters between points in n-dimensional space. The distance or similarity metric is used to account for how close two points are from one another. The most common distance metric with which people are most familiar is euclidean distance, as it is the most commonly taught distance metric in the education system. However, we will explore additional metrics as well as their benefits and drawbacks in the following section. The linkage method is then used to identify how to find a single distance metric for an entire cluster, because hierarchical clustering often involves joining single points into an existing cluster as well as joining existing clusters of points with another. Once again, there are a variety of options for a linkage method, which will also be examined in greater depth below.
The particular type of hierarchical clustering that will be explored is called Agglomerative Clustering, which is the grouping of clusters from the bottom-up. This bottom-up approach means that each point begins as its own cluster and is then grouped together with other points until all points are enclosed in a single cluster. Let’s walk through an initial example before examining the various types of distance and linkage methods.
You have received the following dataset containing information about a number of restaurants that tracks the number of customers that they serviced in the morning and afternoon, respectively. You are tasked with grouping together restaurants that are similar to one another. The data is as follows:
| Restaurant Name | Morning Customers | Afternoon Customers |
|---|---|---|
| Cup of Joe | 150 | 0 |
| Jonny’s B&B | 100 | 10 |
| McDonald’s | 125 | 1000 |
| Corner Tavern | 10 | 750 |
| Burger King | 100 | 850 |
| Brian’s Spirits | 5 | 500 |
To perform hierarchical clustering, we are going to calculate the pairwise distances between all points. For the sake of simplicity, we will use Euclidean distance, also known as the traditional distance formula that most people learned in grade school.
Hierarchical clustering begins by merging the two nearest points, which we can also consider to be one point clusters:
From the distance matrix, we can see that the closest distance between two clusters is between Corner Tavern & Brian’s Spirits which were calculated to be 0.5877 units from one another using our euclidean distance calculation.
These two restaurants are now considered to be in a cluster together, so when we make subsequent merging of clusters, we must decide how to represent the distance from a cluster to a cluster that is now comprised of more than one point. This is where our choice of linkage method comes in. For this example, we will use Single Linkage. The choice of Single Linkage means that we will represent the distance between a point and a cluster as the smallest distance between a point in one cluster and a point in another (this is essentially taking all pairwise distances between points in two respective clusters and representing the distance between the two clusters overall as the minimum of these distances).
In the figure above, we illustrate with the red line that the distance from the Corner Tavern cluster to the cluster that now includes both Burger King and McDonald’s is represented by the distance from Burger King to Corner Tavern. This is due to the fact that Corner Tavern is closer to Burger King than McDonald’s.
We repeat this very same process of updating clusters by merging the two clusters with the smallest distance between them until we have one cluster that contains all of the points.
We can visualize the clustering using something called a dendrogram. A horizontal bar is drawn between the lines representing the level of dissimilarity, represented by distance in this case, at which we join two clusters together. This horizontal bar is drawn at the height relating to the distance between each cluster. Following the example above, we can see the process for creating the dendrogram as we cluster.
Note that the lowest black horizontal bar is the one that connects McDonald’s and Burger King and occurs at the height 0.5877. This is the way a cluster is represented in a dendrogram and is equivalent in concept to the black oval placed around the first cluster in the previous image. The lower the black horizontal bar in a dendrogram, the earlier the two touching clusters are joined. The height at which two clusters are joined represents the distance between them.
Caution: Even if we maintain the method for calculating distance, changing our linkage method could impact the results of the clustering. Below is an example of using another linkage method, complete linkage, when clustering our data.In this case that we selected Single Linkage as our method of choice, we would merge Clusters 1 and 2 first, but if we used Complete Linkage, we would merge Clusters 1 and 3 first.
This concept finally brings us to a discussion on the various types of linkage methods, how the specific methods work, and why they might be used in particular scenarios. While this is not an exhaustive list, it does provide many of the most widely applicable methods:
Generally, Complete Linkage leads to compact clusters, meaning that points within the same cluster are often relatively close to one another.
The Average Linkage Method represents the distance between two clusters as the average distance between each of any point in the first cluster to any point in the second. Essentially, all pairwise distances are calculated across clusters and the average of these distances is considered the overall distance between clusters.
Generally, Average Linkage leads to clusters of points that are neither as compact as Complete Linkage nor as far apart as they are in Single Linkage.
Centroid Linkage takes the average of the points, giving us a centroid for each cluster, and then uses the distance between the centroids to represent the distance between the two clusters. Centroid Linkage can be easily confused with Average Linkage since both are taking an average, however, the key difference comes from when this average is taken. To reiterate: Centroid Linkage takes the average of points and then gets the distance between centroids whereas Average Linkage takes the pairwise distances to the points across the clusters and then averages all such distances.
Generally, Centroid Linkage is used when the centroids of the cluster have a particular meaning or when ease of interpretation is needed. It leads to similar results as Average Linkage. Caution: Despite its simplicity, one issue to note with Centroid Linkage is the possibility of inversion. Inversion is the phenomenon that occurs when two clusters are more similar to one another compared to two clusters that have already been merged. In the previous three methods, we were guaranteed that there would be no inversion, however, centroid linkage introduces this possibility.
Ward’s Linkage turns out to extend Centroid Linkage in calculation, however, it should be thought of as representing the distance between the two clusters by finding the increase of the total Within-Cluster Variation when merging the two clusters together. When we simplify the formula to achieve this, we find the final calculation of the distance between two clusters, Cluster A and Cluster B, to be the following:
\(\frac{N_A * N_B}{N_A + N_B} * d^2(C_A, C_B)\), where \(N_A\) is the number of observations in Cluster A, \(N_B\) is the number of observations in Cluster B, \(C_A\) and \(C_B\) are the centroids of Clusters A and B, respectively, and \(d^2\) represents finding the squared euclidean distance between two points (note this is not standard euclidean distance).
Thus, using Ward’s Linkage Distance can sometimes be similar to Centroid Linkage in terms of it’s compactness, however, it’s acknowledgement of the change in Within-Cluster Variation means that results are not always the same. Additionally, Ward’s Linkage Method is monotonic, meaning that inversion will not be an issue like it is in Centroid Linkage clustering.
An Interestingly Connection to K-Means: Ward’s Linkage Method appears to adopt a somewhat similar approach to the Hartigan-Wong K-Means algorithm to decide merging: using Total Within-Cluster variation to update cluster assignment. The large difference here, however, is that Ward’s Linkage calculates the change in Total Within-Cluster Variation based on the idea of merging all observations between the two existing clusters rather than simply evaluating the current fit of a particular observation in a cluster using Total Within-Cluster Variation. The point we make here is that both use the same concept to update clusters, though they are required to implement it in their own way. This is also a friendly reminder that Hierarchical Clustering makes assignments in a sequential manner, continuing to build upon assignments over time until all observations fall under the same, overarching cluster, whereas K-Means iterates through the observations (after initially assigning all observations to a cluster), continuously evaluating whether the points placement in another cluster would reduce the total Within-Cluster Variation compared to their current assignment.
Here we explore a sample dataset based on a dataset from the National Institute of Standards and Technology. The data consists of tabular autosomal Short Tandem Repeats (STRs) at 24 loci, or locations on the genome.
We have separated these profiles into 3 population reference groups: POP1, POP2, and POP3. We want to identify if the STR profiles cluster into distinct population groups.
First, we read in the data, remove NAs and sample 75 observations from each population group to ensure even sample sizes.
library(tidyverse)
set.seed(211)
setwd('/Users/colleencallahan/Desktop/MSDS/Fall 2020/SYS 6018')
df <- read.csv("Sample_GenePop_Data.csv")
df <- na.omit(df) # remove NA's
## Sample 75 observations from each population
pop1tmp <- df[which(df$Pop=='POP1'), ]
pop2tmp <- df[which(df$Pop=='POP2'), ]
pop3tmp <- df[which(df$Pop=='POP3'), ]
pop1samplerows <- sample(nrow(pop1tmp), size=75)
pop1sample <- subset(pop1tmp[pop1samplerows,])
pop2samplerows <- sample(nrow(pop2tmp), size=75)
pop2sample <- subset(pop2tmp[pop2samplerows,])
pop3samplerows <- sample(nrow(pop3tmp), size=75)
pop3sample <- subset(pop3tmp[pop3samplerows,])
newdf <- rbind(pop1sample, pop2sample)
newdf <- rbind(newdf, pop3sample)
newdf[1:5,1:10]
## Pop CSF1PO CSF1PO.2 D10S1248 D10S1248.2 D12S391 D12S391.2 D13S317
## 28 POP1 10 12 12 12 17 18 11
## 63 POP1 11 12 14 15 16 21 10
## 31 POP1 10 13 12 13 18 18 12
## 112 POP1 8 10 14 15 17 19 12
## 229 POP1 9 11 12 13 19 19 11
## D13S317.2 D16S539
## 28 12 9
## 63 11 11
## 31 12 9
## 112 13 12
## 229 14 11
library(plyr)
library(ape)
Next, we reduce the DNA profile data to two dimensions using multidimensional scaling (MDS).
The distance between profiles is first calculated using the dist.gene() function from the package ape. Based on the documentation, this function “computes a matrix of distances between pairs of individuals from a matrix or a data frame of genetic data.” It takes a genetic dataframe as an input, and with “pairwise=TRUE” it calculates the pairwise similarity between profiles in the dataframe. The similarity distance is based on the distance d which is the number of loci for which the profiles differ. The variance is d(L-d)/L where L is the total number of loci.
Multidimensional scaling then takes this set of dissimilarities between profiles and returns a set of points x and y such that the distances between the (x,y) points are approximately equal to the dissimilarities between profiles.
## Set seed for reproducibility
set.seed(211)
## Calculate distance for pairwise DNA profiles
dX = dist.gene(newdf[,-1], method="pairwise")
## Run multidimensional scaling to obtain x and y coordinates from tabular DNA profile data
fit <- cmdscale(dX, eig=TRUE, k=2) # k is the number of dim
## Split out dimensions to x and y points
newcluster <- fit$points
## Add in population reference group
newcluster <- cbind(as.data.frame(newcluster), Pop=newdf$Pop)
Before running any clustering algorithms, the true population distribution can be seen below.
ggplot(newcluster) + geom_point(aes(x=V1, y=V2, color=Pop)) + labs(color='True Population')
For our first clustering approach, we run K means clustering with K=3 on the MDS points to identify how well these cluster into three distinct groups. We can assess the accuracy using a confusion matrix for the cluster groups compared against the true population groups.
## Run k means
km = kmeans(newcluster[1:2], centers=3, nstart=25) # choose K=3
## Produce confusion matrix
tab <- table(true=newcluster$Pop, est=km$cluster)
## Map cluster numbers to appropriate population group
mapping1 <- apply(tab,2,which.max)
groups <- mapvalues(km$cluster, c(1,2,3), mapping1)
## Plot results of kmeans colored by cluster
ggplot(newcluster) + geom_point(aes(x=V1, y=V2, color=as.factor(groups))) + labs(color='Cluster')
## Final accuracy table with correct classification
t <- table(newcluster$Pop, groups)
t
## groups
## 1 2 3
## POP1 69 0 6
## POP2 2 67 6
## POP3 10 10 55
## Accuracy
sum(diag(t)) /sum(t)
## [1] 0.8488889
The accuracy of K means clustering on the multidimensionally scaled data with K=3 is 0.85.
Next, we run hierarchical clustering on the multidimensionally scaled data using the Euclidean distance metric.
We run the hierarchical clustering with 5 different linkage methods: complete, single, centroid, average, and ward.D2.
## Set seed for reproducibility
set.seed(211)
## Calculate euclidean distance for MDS points
dX1 = dist(newcluster[1:2], method="euclidean")
## Run hierarchical clustering for all 5 linkage methods
hc1 = hclust(dX1, method="complete")
hc2 = hclust(dX1, method="single")
hc3 = hclust(dX1, method="centroid")
hc4 = hclust(dX1, method="average")
hc5 = hclust(dX1, method="ward.D2")
## Plot all 5 dendrograms
plot(as.dendrogram(hc1), las=1, main='Complete linkage')
plot(as.dendrogram(hc2), las=1, main='Single linkage')
plot(as.dendrogram(hc3), las=1, main='Centroid linkage')
plot(as.dendrogram(hc4), las=1, main='Average linkage')
plot(as.dendrogram(hc5), las=1, main='Ward linkage')
It is clear that ward.D2 makes the most sense in this context, so we cut the dendrogram at height 90 to evaluate the 3 clusters versus the true population groups.
set.seed(211)
## Cut the dendrogram at height 90
clusters <- cutree(hc5, h=90)
## Produce confusion matrix
tab2 <- table(newdf$Pop, clusters)
## Map cluster numbers to appropriate population group
mapping2 <- apply(tab2,2,which.max)
groups2 <- mapvalues(clusters, c(1,2,3),mapping2)
## Plot hierarchical clustering colored by cluster
ggplot(newcluster) + geom_point(aes(x=V1, y=V2, color=as.factor(groups2))) + labs(color='Cluster')
## Final confusion matrix with correct classification
t <- table(newdf$Pop, groups2)
t
## groups2
## 1 2 3
## POP1 71 0 4
## POP2 7 64 4
## POP3 25 3 47
## Accuracy
sum(diag(t)) /sum(t)
## [1] 0.8088889
The accuracy of hierarchical clustering on the multidimensionally scaled data using Euclidean distance and Ward linkage is 0.81.
Finally, we look at Gaussian Mixture Modeling (GMM), and plot the resulting clusters.
library(mclust)
## Set seed for reproducibility
set.seed(211)
## Run GMM on new data points obtained from MDS
mix = Mclust(newcluster[1:2], verbose=FALSE)
## Plot resulting clusters and uncertainty
plot(mix, what='classification')
plot(mix, what='uncertainty')
The first plot is the raw clustering of the x and y coordinates obtained from MDS, with each group’s corresponding centroids and standard deviations. The second plot shows the uncertainty of each data point, represented by the size of each data point (with smaller data points being the most certain and larger data points having higher uncertainty).
## Find the most likely cluster based on probabilities from GMM
preds <- apply(mix$z,1,which.max)
## Produce confusion matrix against true population groups
tab3 <- table(newdf$Pop,preds)
## Map cluster numbers to appropriate population group
mapping3 <- apply(tab3,2,which.max)
groups3 <- mapvalues(preds, c(1,2,3), mapping3)
## Final confusion matrix with correct classification
t <- table(newdf$Pop, groups3)
t
## groups3
## 1 2 3
## POP1 63 0 12
## POP2 2 64 9
## POP3 9 3 63
## Accuracy
sum(diag(t)) /sum(t)
## [1] 0.8444444
The accuracy of GMM on the multidimensionally scaled data is 0.84.
Hopefully by reading this, you have laid a great foundation for using clustering in your work or projects related to data science! We covered the basic concepts of clustering and described some applications. We also dug deeper into three of the most common clustering techniques: k-means, hierarchical, and Gaussian mixture models. Through these in-depth explanations and the step-by-step examples that followed, you should be ready to implement and explain these clustering algorithms yourself! Thank you for reading and good luck in your future work!
Aru, Sanya. What Are the Benefits of Customer Segmentation for Your eCommerce Business?, 2020. https://makewebbetter.com/blog/benefits-customer-segmentation-higher-profitability-holiday-sales/
Carbajal, Maria. Separation and acquisition of two languages in early childhood: A multidisciplinary approach, 2020. https://hal.archives-ouvertes.fr/tel-01948483/
Frost, Jim. Standardization, Statistics by Jim, 2020. https://statisticsbyjim.com/glossary/standardization/ Gohrani, Kunal. Different Types of Distance Metrics Used in Machine Learning, 2019.https://medium.com/@kunal_gohrani/different-types-of-distance-metrics-used-in-machine-learning-e9928c5e26c7
Gorur, Dilan. Introduction to Clustering, University of California Irvine, 2011. https://www.math.uci.edu/icamp/summer/research_11/gorur/clustering_iCAMP.pdf
Kaiser, Jocelyn. A judge said police can search the DNA of 1 million American without their consent. What’s next? American Association for the Advancement of Science, 2019. https://www.sciencemag.org/news/2019/11/judge-said-police-can-search-dna-millions-americans-without-their-consent-what-s-next
Kestemont, Michael, et al. Overview of Author Identification Task at PAN-2018, 2018. http://ceur-ws.org/Vol-2125/invited_paper_2.pdf
Machine Learning in Action. Email Spam Filtering: An Implementation with Python and Scikit-learn, KDnuggets. https://www.kdnuggets.com/2017/03/email-spam-filtering-an-implementation-with-python-and-scikit-learn.html
Malliaros, Fragkiskos and Michalis Vazirgiannis. Clustering and Community Detection in Directed Networks, Cornell University, 2013. https://arxiv.org/abs/1308.0971
Maynard, John. 4 ways government program managers can solve the fraud Catch-22, 2019. https://gcn.com/articles/2019/06/03/fraud-analytics.aspx
Morissette, Laurence, and Sylvain Chartier. “The k-means clustering technique: General considerations and implementation in Mathematica.” Tutorials in Quantitative Methods for Psychology 9.1 (2013): 15-24.
Porter, Michael. Clustering, University of Virginia, 2020. https://mdporter.github.io/SYS6018/lectures/06-clustering.pdf
Rosebrock, Adrian. What is the Jaccard Index?, 2016. https://deepai.org/machine-learning-glossary-and-terms/jaccard-index
Sahu, Beeren. Explained: Deeplab for Semantic Segmentation, 2019. https://beerensahu.wordpress.com/2019/02/07/explained-deeplab-for-semantic-segmentation/
Slonim, N. et al. “Hartigan’s K-Means Versus Lloyd’s K-Means - Is It Time for a Change?” IJCAI (2013).
Tibshirani, Ryan. “Clustering 3: Hierarchical clustering (continued); choosing the number of clusters.” Data Mining: 36-462/36-662, January 31, 2013. Carnegie Mellon University. PDF Presentation.
Whang, Joyce and Dhillon, Inderjit. Overlapping Community Detection in Massive Social Networks, Center for Big Data Analytics. https://bigdata.oden.utexas.edu/project/graph-clustering/